Search CORE

14 research outputs found

MaxViT: Multi-Axis Vision Transformer

Author: Bovik Alan
Li Yinxiao
Milanfar Peyman
Talebi Hossein
Tu Zhengzhong
Yang Feng
Zhang Han
Publication venue
Publication date: 04/04/2022
Field of study

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to "see" globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5\% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7\% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. We will make the code and models publicly available

arXiv.org e-Print Archive

Recommended from our members

Quality prediction and visual enhancement of user-generated content

Author: Tu Zhengzhong
Publication venue
Publication date: 22/11/2022
Field of study

With the rapid development of streaming media technologies coupled with the explosion of user-generated content (UGC) captured and streamed over social media platforms, such as YouTube and Facebook, videos now play a central role in the daily lives of billions of people. The increased popularity of UGC videos has catalyzed the great need to understand and analyze billions of these shared contents to optimize pipelines of efficient UGC video storage, processing, and streaming. UGC videos, which are typically created by amateur videographers, often suffer from unsatisfactory perceptual quality, arising from any process throughout video acquisition. In this regard, predicting UGC video quality is much more challenging than assessing the quality of synthetically distorted videos in traditional video quality databases. In this dissertation, we will comprehensively investigate the quality prediction and enhancement problems for UGC pictures and videos. We first study a particular artifact, "the banding artifact," that is a common video compression impairment. We approach this artifact by first analyzing the perceptual and encoding aspects of color bands, then build a new distortion-specific no-reference quality metric dedicated to banding visibility. Furthermore, we aim at building a banding artifact removal algorithm by formulating it as a visual enhancement problem. Accordingly, we propose to solve it by applying a form of content-adaptive smoothing filter followed by dithered quantization, as a post-processing module. We also extend this debanding filter by learning a cascaded artifact removal network to jointly remove banding and blocking artifacts, yielding greater visual enhancement. UGC distortions are more diverse, complicated, commingled, and thus no single quality factor can suffice to predict the overall quality. Blindly predicting the perceptual quality of UGC videos are very challenging. We first conducted a benchmark study on recent large-scale UGC video databases using leading popular no-reference video quality metrics, then propose to leverage feature selection to build a new compact video quality model on top of a curated list of previous effective spatial and temporal features from popular VQA models, which we dub VIDEVAL. In addition to this compact model, we also proposed to build an efficiency-oriented fast model for practical purposes called RAPIQUE by combining efficient natural scene statistics features with pre-trained deep learning models. This model would involve designing an aggressive spatial and temporal sampling process to boost its efficiency. Within the model building, we would also explore the temporal statistics of natural videos, which would contribute to pushing forward the performance of VQA models for motion-intensive videos with large camera motion. Next, we study visual restoration and enhancement of pictures degraded by common distortions existed in UGC videos, including noise, blur, low-light, etc. Based on recent progress on Transformer and multi-layer perceptron (MLP) models, we propose an efficient MLP-based vision backbone, which we dub MAXIM, that can effectively restore images suffering from degradation. The core component of MAXIM is called the multi-axis gated MLP block that achieves a local and global spatial interactions in linear complexity. We further extend this idea to high-level vision tasks such as image recognition by proposing another vision backbone called MaxViT. Our extensive numerical and visual experiments have shown that this multi-axis approach provides a strong vision component for both high-level and low-level vision tasks. Finally, we conclude the thesis with some remarks on the current challenges and future directions regarding the UGC video quality prediction and enhancement problems.Electrical and Computer Engineerin

Texas ScholarWorks

RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content

Author: Adsumilli Balu
Birkbeck Neil
Bovik Alan C.
Tu Zhengzhong
Wang Yilin
Yu Xiangxu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Blind or no-reference video quality assessment of user-generated content (UGC) has become a trending, challenging, heretofore unsolved problem. Accurate and efficient video quality predictors suitable for this content are thus in great demand to achieve more intelligent analysis and processing of UGC videos. Previous studies have shown that natural scene statistics and deep learning features are both sufficient to capture spatial distortions, which contribute to a significant aspect of UGC video quality issues. However, these models are either incapable or inefficient for predicting the quality of complex and diverse UGC videos in practical applications. Here we introduce an effective and efficient video quality model for UGC content, which we dub the Rapid and Accurate Video Quality Evaluator (RAPIQUE), which we show performs comparably to state-of-the-art (SOTA) models but with orders-of-magnitude faster runtime. RAPIQUE combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features, allowing us to design the first general and efficient spatial and temporal (space-time) bandpass statistics model for video quality modeling. Our experimental results on recent large-scale UGC video quality databases show that RAPIQUE delivers top performances on all the datasets at a considerably lower computational expense. We hope this work promotes and inspires further efforts towards practical modeling of video quality problems for potential real-time and low-latency applications. To promote public usage, an implementation of RAPIQUE has been made freely available online: \url{https://github.com/vztu/RAPIQUE}.Comment: IEEE Open Journal of Signal Processing 202

arXiv.org e-Print Archive

Directory of Open Access Journals

CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Author: Ma Jiaqi
Shao Wei
Tu Zhengzhong
Xiang Hao
Xu Runsheng
Zhou Bolei
Publication venue
Publication date: 05/07/2022
Field of study

Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems which are difficult to handle occlusions and detect distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, which can dramatically improve the perception performance and range as compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions. To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention or FAX module, which can capture sparsely local and global spatial interactions across views and agents. The extensive experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks, including 1) BEV segmentation with single-agent multi-camera and 2) 3D object detection with multi-agent LiDAR systems, and achieves state-of-the-art performance with real-time inference speed

arXiv.org e-Print Archive

ROMNet: Renovate the Old Memories

Author: Dong Xiaoyu
Du Yuanqi
Li Jinlong
Ma Jiaqi
Meng Zibo
Tu Zhengzhong
Xu Runsheng
Yu Hongkai
Publication venue
Publication date: 27/04/2022
Field of study

Renovating the memories in old photos is an intriguing research topic in computer vision fields. These legacy images often suffer from severe and commingled degradations such as cracks, noise, and color-fading, while lack of large-scale paired old photo datasets makes this restoration task very challenging. In this work, we present a novel reference-based end-to-end learning framework that can jointly repair and colorize the degraded legacy pictures. Specifically, the proposed framework consists of three modules: a restoration sub-network for degradation restoration, a similarity sub-network for color histogram matching and transfer, and a colorization subnet that learns to predict the chroma elements of the images conditioned on chromatic reference signals. The whole system takes advantage of the color histogram priors in a given reference image, which vastly reduces the dependency on large-scale training data. Apart from the proposed method, we also create, to our knowledge, the first public and real-world old photo dataset with paired ground truth for evaluating old photo restoration models, wherein each old photo is paired with a manually restored pristine image by PhotoShop experts. Our extensive experiments conducted on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-arts both quantitatively and qualitatively.Comment: Paper major revisio

arXiv.org e-Print Archive